Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hoist collapse shape out of scf.forall when possible and expand its destination #19044

Merged
merged 7 commits into from
Nov 13, 2024

Conversation

nirvedhmeshram
Copy link
Contributor

@nirvedhmeshram nirvedhmeshram commented Nov 6, 2024

This pattern allows to hoist out collapse shape producer of tensor.parallel_insert_slice and expand the enclosing scf.forall destination. This is only safe because we are doing this on workgroup mapped sc.forall's and hence the slices are disjoint. The reason to have this pattern is that it allows us to eliminate empty tensors during bufferization which would otherwise be blocked due to the collapse.

@nirvedhmeshram nirvedhmeshram force-pushed the hoist_collapse_pr branch 2 times, most recently from d58c313 to e44793b Compare November 6, 2024 17:04
@nirvedhmeshram nirvedhmeshram changed the title [GPU] Hoist collapse shape out of scf.forall when possible and expand its destination Hoist collapse shape out of scf.forall when possible and expand its destination Nov 7, 2024
Copy link
Member

@kuhar kuhar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some nits

Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far looking good! I left some comments about some potential edge cases, and some more tests that would be good to have for the first round of comments.

@nirvedhmeshram nirvedhmeshram force-pushed the hoist_collapse_pr branch 2 times, most recently from 9523904 to 48d5691 Compare November 11, 2024 22:16
@nirvedhmeshram nirvedhmeshram removed the request for review from benvanik November 11, 2024 22:16
Copy link
Contributor

@Max191 Max191 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! The code looks a lot cleaner now too, nice work!

@Max191
Copy link
Contributor

Max191 commented Nov 12, 2024

Hmm, looks like punet fails to compile with this. I think it must be one of the conv dispatches (because nothing else uses this pass), but it seems a bit strange to me that this is causing a compilation failure.

@nirvedhmeshram
Copy link
Contributor Author

Hmm, looks like punet fails to compile with this. I think it must be one of the conv dispatches (because nothing else uses this pass), but it seems a bit strange to me that this is causing a compilation failure.

Yeah thats a bummer, one thing I can try is to see if CI passes when I disable multi-result, if thats the case then the issue might be that the getTiedResult function assumes the multiple parallel_insert_slice are ordered in the same way as the results.

@Max191
Copy link
Contributor

Max191 commented Nov 12, 2024

I took a look at the model, and found the dispatch that is failing. Here's a gist with a dump: https://gist.github.com/Max191/a092ec5c81e5f9f44aee19358f007a34

Looks like the problem is that the collapse_shape was hoisted out of the forall, but then could not fold with the flow.dispatch.tensor.store at the function boundary, because the store has an offset (from pad fusion). I'm not quite sure what the best solution for this is at the moment.

EDIT: snippet from the dump after reshape propagation:

    } -> (tensor<1x2x8x16x4x16xf16>, tensor<1x2x8x16x4x16xi8>)
    scf.forall.in_parallel {
      tensor.parallel_insert_slice %27#0 into %arg3[%arg0, %arg1, 0, 0, %arg2, 0] [1, 2, 8, 16, 4, 16] [1, 1, 1, 1, 1, 1] : tensor<1x2x8x16x4x16xf16> into tensor<2x128x8x16x20x16xf16>
      tensor.parallel_insert_slice %27#1 into %arg4[%arg0, %arg1, 0, 0, %arg2, 0] [1, 2, 8, 16, 4, 16] [1, 1, 1, 1, 1, 1] : tensor<1x2x8x16x4x16xi8> into tensor<2x128x8x16x20x16xi8>
    }
  } {mapping = [#iree_codegen.workgroup_mapping<z>, #iree_codegen.workgroup_mapping<y>, #iree_codegen.workgroup_mapping<x>]}
  %collapsed = tensor.collapse_shape %20#1 [[0], [1], [2, 3], [4, 5]] : tensor<2x128x8x16x20x16xi8> into tensor<2x128x128x320xi8>
  flow.dispatch.tensor.store %collapsed, %7, offsets = [0, 1, 1, 0], sizes = [2, 128, 128, 320], strides = [1, 1, 1, 1] : tensor<2x128x128x320xi8> -> !flow.dispatch.tensor<readwrite:tensor<2x130x130x320xi8>>
  flow.dispatch.tensor.store %20#0, %8, offsets = [0, 0, 0, 0, 0, 0], sizes = [2, 128, 8, 16, 20, 16], strides = [1, 1, 1, 1, 1, 1] : tensor<2x128x8x16x20x16xf16> -> !flow.dispatch.tensor<writeonly:tensor<2x128x8x16x20x16xf16>>
  return
}

@Max191
Copy link
Contributor

Max191 commented Nov 12, 2024

One way to fix this would be to check that the consumer of the scf.forall is a flow.dispatch.tensor.store with a full slice (i.e., zero offsets, unit strides, and source_sizes == target_sizes). This is making the pattern even more specific, but it is already quite specific, so maybe it is okay for now.

@nirvedhmeshram
Copy link
Contributor Author

nirvedhmeshram commented Nov 12, 2024

One way to fix this would be to check that the consumer of the scf.forall is a flow.dispatch.tensor.store with a full slice (i.e., zero offsets, unit strides, and source_sizes == target_sizes). This is making the pattern even more specific, but it is already quite specific, so maybe it is okay for now.

Makes sense, I guess theoretically it should be possible to support this?

flow.dispatch.tensor.store %20#1, %7, offsets = [0, 1, 0, 1, 0, 0], sizes = [2, 128, 8, 16, 20, 16], strides = [1, 1, 1, 1, 1, 1] : tensor<2x128x8x16x20x16xf16> -> !flow.dispatch.tensor<writeonly:tensor<2x130x130x20x16xf16>

@Max191
Copy link
Contributor

Max191 commented Nov 12, 2024

Following up here from a discussion with Nirvedh offline. We are going with the solution I proposed here: #19044 (comment)

As long as the convolution has a consumer in the dispatch, then the pipeline should be able to handle not hoisting the collapse_shape when the dispatch.tensor.store has offsets (when a pad consumer has been fused). If there is no consumer, and the result of the convolution is directly stored with some offsets, then this will cause problems, but I don't think it is a case that is likely to be seen. There is typically an operation between convolutions and pads, and pad fusion does not happen by default anyway. It is a case to be aware of, but I think it is fine to proceed with this additional restriction for now.

In other words, the following case would be problematic if the pad is fused with the producer convolution:

%conv = linalg.conv_2d_nhwc_hwcf ...
%pad = tensor.pad %conv ...

However, this seems like an uncommon case, and pad fusion only happens with aggressive fusion anyway.

@nirvedhmeshram nirvedhmeshram force-pushed the hoist_collapse_pr branch 2 times, most recently from fdf1b03 to ddcaeda Compare November 12, 2024 19:10
@nirvedhmeshram nirvedhmeshram merged commit f3c1467 into iree-org:main Nov 13, 2024
33 of 36 checks passed
nirvedhmeshram added a commit that referenced this pull request Nov 13, 2024
…or all users (#19139)

The existing pattern added in
#19044 created different offsets
for each user even though we previously checked that the offsets will be
exactly same. This was preventing recursive application of the pattern
as the comparison of the offsets for the next application of patten
would fail. The change in this PR is tested by removing cse in test file
which was added by #19044 to
workaround this exact issue.

Signed-off-by: Nirvedh Meshram <[email protected]>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…estination (iree-org#19044)

This pattern allows to hoist out collapse shape producer of
tensor.parallel_insert_slice and expand the enclosing scf.forall
destination. This is only safe because we are doing this on workgroup
mapped sc.forall's and hence the slices are disjoint. The reason to have
this pattern is that it allows us to eliminate empty tensors during
bufferization which would otherwise be blocked due to the collapse.

---------

Signed-off-by: Nirvedh <[email protected]>
Groverkss pushed a commit to Groverkss/iree that referenced this pull request Dec 1, 2024
…or all users (iree-org#19139)

The existing pattern added in
iree-org#19044 created different offsets
for each user even though we previously checked that the offsets will be
exactly same. This was preventing recursive application of the pattern
as the comparison of the offsets for the next application of patten
would fail. The change in this PR is tested by removing cse in test file
which was added by iree-org#19044 to
workaround this exact issue.

Signed-off-by: Nirvedh Meshram <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…estination (iree-org#19044)

This pattern allows to hoist out collapse shape producer of
tensor.parallel_insert_slice and expand the enclosing scf.forall
destination. This is only safe because we are doing this on workgroup
mapped sc.forall's and hence the slices are disjoint. The reason to have
this pattern is that it allows us to eliminate empty tensors during
bufferization which would otherwise be blocked due to the collapse.

---------

Signed-off-by: Nirvedh <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
giacs-epic pushed a commit to giacs-epic/iree that referenced this pull request Dec 4, 2024
…or all users (iree-org#19139)

The existing pattern added in
iree-org#19044 created different offsets
for each user even though we previously checked that the offsets will be
exactly same. This was preventing recursive application of the pattern
as the comparison of the offsets for the next application of patten
would fail. The change in this PR is tested by removing cse in test file
which was added by iree-org#19044 to
workaround this exact issue.

Signed-off-by: Nirvedh Meshram <[email protected]>
Signed-off-by: Giacomo Serafini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants